activity-3.Rmd in your in-class directoryoutput: github_documenttidyverse and babynamesThe output of r chunks can be cached (saved) so that they don't need to be re-run everytime you re-knit.
#```{r compute-stuff, cache = TRUE}
# my crazy intensive computing step
#```
cyl4s <- mtcars[mtcars$cyl == 4, ] mean(cyl4s$mpg)
## [1] 26.66364
library(dplyr) cyl4s <- filter(mtcars, cyl == 4) summarize(cyl4s, mean = mean(mpg))
## mean ## 1 26.66364
cyl4s <- mtcars[mtcars$cyl == 4, ]
[rows, columns]mean(cyl4s$mpg)
## [1] 26.66364
$.cyl4s <- filter(mtcars, cyl == 4)
dplyr contains separate functions to filter the rows and select the columns.$).cyl4s <- mtcars %>% filter(cyl == 4)
summarize(cyl4s, mean = mean(mpg))
## mean ## 1 26.66364
summarize().cyl4s %>% summarize(mean = mean(mpg))
## mean ## 1 26.66364
cyl4s <- mtcars[mtcars$cyl == 4, ] mean(cyl4s$mpg)
## [1] 26.66364
library(dplyr) cyl4s <- filter(mtcars, cyl == 4) summarize(cyl4s, mean = mean(mpg))
## mean ## 1 26.66364
Data sets are often of high volume (lots of rows) and high variety (lots of columns). This is overwhelming to visualize and analyze, so we find ourselves chopping the data set up into more manageable and meaningful chunks. We also often need to perform operations to organize and clean our data.
This is all possible in base R, but with dplyr, it is simple, readible, and fast.
select()filter()mutate()arrange()summarize()tbl_df%>%select()Subset the columns.
filter()Subset the rows.
mutate()Add or modify a column.
rename()Rename a column.
arrange()Sort the rows.
summarize()Summarize column(s) into a single row.
babynames that indicates whether or not the name proportion in that yearxgender was greater than 1%. Call that column is_popular.babynames that indicates whether or not the name proportion in that yearxgender was greater than 1%. Call that column is_popular.babynames %>% mutate(is_popular = prop > .01) %>% filter(is_popular == TRUE) %>% arrange(prop)
## # A tibble: 3,875 x 6 ## year sex name n prop is_popular ## <dbl> <chr> <chr> <int> <dbl> <lgl> ## 1 1894 F Grace 2361 0.0100 T ## 2 1969 F Karen 17637 0.0100 T ## 3 1983 F Crystal 17904 0.0100 T ## 4 2000 F Madison 19967 0.0100 T ## 5 1960 M Gary 21688 0.0100 T ## 6 1911 F Florence 4424 0.0100 T ## 7 1967 M Eric 17828 0.0100 T ## 8 1957 M Larry 21913 0.0100 T ## 9 2003 M Daniel 21032 0.0100 T ## 10 1934 M Frank 10635 0.0100 T ## # ... with 3,865 more rows
bella that contains all instances of babies named "Bella" and all of the variables except prop.class() is this object?bella that contains all instances of babies named "Bella" and all of the variables except prop.class() is this object?bella <- babynames %>% select(year, sex, name, n) %>% filter(name == "Bella")
or
bella <- babynames %>% select(-prop) %>% filter(name == "Bella") class(bella)
## [1] "tbl_df" "tbl" "data.frame"
babynames %>% select(year, sex, name, n) %>% filter(name == "Bella") %>% ggplot(aes(x = year, y = n)) + geom_line(aes(color = sex))
%>%) is provided by the magrittr package|) in UNIXThe expression
mydata %>% verb(arg1)
is the same as:
verb(mydata, arg1)
.Instead of having to read/write:
select(filter(mutate(data, arg1), arg2), arg3)
You can do:
data %>% mutate(arg1) %>% filter(arg2) %>% select(arg3)
Nested form:
bop(scoop(hop(foo_foo, through = forest), up = field_mice, on = head))
With pipes:
foo_foo %>% hop(through = forest) %>% scoop(up = field_mice) %>% bop(on = head)
More on that analogy: - https://github.com/hadley/r4ds/blob/master/pipes.Rmd
summarize() or graphics, find the year in which your name was used most frequently.summarize() or graphics, find the year in which your name was used most frequently.babynames %>% filter(name == "Andrew") %>% arrange(desc(n))
## # A tibble: 245 x 5 ## year sex name n prop ## <dbl> <chr> <chr> <int> <dbl> ## 1 1987 M Andrew 36199 0.0186 ## 2 1988 M Andrew 35727 0.0179 ## 3 1989 M Andrew 34809 0.0166 ## 4 1986 M Andrew 33853 0.0176 ## 5 1990 M Andrew 33653 0.0156 ## 6 1991 M Andrew 31525 0.0149 ## 7 1992 M Andrew 30534 0.0146 ## 8 1985 M Andrew 30160 0.0157 ## 9 1993 M Andrew 27314 0.0132 ## 10 1994 M Andrew 26006 0.0128 ## # ... with 235 more rows
summarize() or graphics, find the year in which your name was used most frequently.babynames %>% filter(year == 1987) %>% arrange(desc(n))
## # A tibble: 21,399 x 5 ## year sex name n prop ## <dbl> <chr> <chr> <int> <dbl> ## 1 1987 M Michael 63638 0.0327 ## 2 1987 F Jessica 55990 0.0299 ## 3 1987 F Ashley 54845 0.0293 ## 4 1987 M Christopher 54476 0.0279 ## 5 1987 M Matthew 46473 0.0238 ## 6 1987 F Amanda 41786 0.0223 ## 7 1987 M Joshua 39991 0.0205 ## 8 1987 M David 36776 0.0189 ## 9 1987 M Andrew 36199 0.0186 ## 10 1987 M Daniel 35857 0.0184 ## # ... with 21,389 more rows
group_by()Indicate sub-groups of rows in a data set.
group_by()Indicate sub-groups of rows in a data set.
Total number of names in each year
babynames %>% group_by(year) %>% summarize(count = n())
## # A tibble: 136 x 2 ## year count ## <dbl> <int> ## 1 1880 2000 ## 2 1881 1935 ## 3 1882 2127 ## 4 1883 2084 ## 5 1884 2297 ## 6 1885 2294 ## 7 1886 2392 ## 8 1887 2373 ## 9 1888 2651 ## 10 1889 2590 ## # ... with 126 more rows
babynames %>% group_by(year) %>% summarize(num_births = sum(n)) %>% arrange(desc(num_births))
## # A tibble: 136 x 2 ## year num_births ## <dbl> <int> ## 1 1957 4200146 ## 2 1959 4156617 ## 3 1960 4154877 ## 4 1961 4140040 ## 5 1958 4131784 ## 6 1956 4121274 ## 7 1962 4035565 ## 8 1955 4012757 ## 9 2007 3992280 ## 10 1954 3979147 ## # ... with 126 more rows
babynames %>% group_by(name) %>% summarize(earliest = min(year), latest = max(year))
## # A tibble: 95,025 x 3 ## name earliest latest ## <chr> <dbl> <dbl> ## 1 Aaban 2007 2015 ## 2 Aabha 2011 2015 ## 3 Aabid 2003 2003 ## 4 Aabriella 2008 2015 ## 5 Aada 2015 2015 ## 6 Aadam 1987 2015 ## 7 Aadan 2003 2015 ## 8 Aadarsh 2000 2015 ## 9 Aaden 2001 2015 ## 10 Aadesh 2005 2011 ## # ... with 95,015 more rows
Choose two of the following exercises to add to your activity and also please post a graphic to slack (the last bullet).
distinct()Return the rows with duplicate labels.
babynames %>% distinct(name)
## # A tibble: 95,025 x 1 ## name ## <chr> ## 1 Mary ## 2 Anna ## 3 Emma ## 4 Elizabeth ## 5 Minnie ## 6 Margaret ## 7 Ida ## 8 Alice ## 9 Bertha ## 10 Sarah ## # ... with 95,015 more rows
babynames %>%
filter(name == "Jackie") %>%
group_by(year) %>%
summarize(n(),
sum(n),
sum(ifelse(sex == "M", n, 0)))
What will this return?
babynames %>%
filter(name == "Jackie") %>%
group_by(year) %>%
summarize(n(),
sum(n),
sum(ifelse(sex == "M", n, 0)))
## # A tibble: 117 x 4 ## year `n()` `sum(n)` `sum(ifelse(sex == "M", n, 0))` ## <dbl> <int> <int> <dbl> ## 1 1899 1 9 0 ## 2 1900 1 12 0 ## 3 1901 1 7 0 ## 4 1902 1 6 0 ## 5 1903 1 12 0 ## 6 1904 2 25 5.00 ## 7 1905 1 9 0 ## 8 1906 1 14 0 ## 9 1907 1 15 0 ## 10 1908 2 28 5.00 ## # ... with 107 more rows
babynames %>%
filter(name == "Jackie") %>%
rename(n_births = n) %>%
group_by(year) %>%
summarize(number_of_rows = n(),
n_total_births = sum(n_births),
n_boy_births = sum(ifelse(sex == "M", n_births, 0)))
## # A tibble: 117 x 4 ## year number_of_rows n_total_births n_boy_births ## <dbl> <int> <int> <dbl> ## 1 1899 1 9 0 ## 2 1900 1 12 0 ## 3 1901 1 7 0 ## 4 1902 1 6 0 ## 5 1903 1 12 0 ## 6 1904 2 25 5.00 ## 7 1905 1 9 0 ## 8 1906 1 14 0 ## 9 1907 1 15 0 ## 10 1908 2 28 5.00 ## # ... with 107 more rows
Data that resides in multiple tables with relations that link observations between pairs of tables (often not a one-one relationship!).
Consider two tables with rows identified by their key.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
x
## # A tibble: 3 x 2 ## key val_x ## <dbl> <chr> ## 1 1.00 x1 ## 2 2.00 x2 ## 3 3.00 x3
y
## # A tibble: 3 x 2 ## key val_y ## <dbl> <chr> ## 1 1.00 y1 ## 2 2.00 y2 ## 3 4.00 y3
An inner join returns the rows that have a key in both tables, with the combined columns from both tables.
x %>%
inner_join(y, by = c("key" = "key"))
## # A tibble: 2 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1.00 x1 y1 ## 2 2.00 x2 y2
A left join returns the rows that have a key in the left table, with the combined columns from both tables, filling in NAs where there are no key matches in the right table.
x %>%
left_join(y, by = c("key" = "key"))
## # A tibble: 3 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1.00 x1 y1 ## 2 2.00 x2 y2 ## 3 3.00 x3 <NA>
A right join returns the rows that have a key in the right table, with the combined columns from both tables, filling in NAs where there are no key matches in the left table.
x %>%
right_join(y, by = c("key" = "key"))
## # A tibble: 3 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1.00 x1 y1 ## 2 2.00 x2 y2 ## 3 4.00 <NA> y3
A full join returns the rows that have a key either table, with the combined columns from both tables, filling in NAs where there are no key matches in the corresponding table.
x %>%
full_join(y, by = c("key" = "key"))
## # A tibble: 4 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1.00 x1 y1 ## 2 2.00 x2 y2 ## 3 3.00 x3 <NA> ## 4 4.00 <NA> y3
Sometimes helpful but runs into problems when keys don't uniquely identify an observation.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
x
## # A tibble: 4 x 2 ## key val_x ## <dbl> <chr> ## 1 1.00 x1 ## 2 2.00 x2 ## 3 2.00 x3 ## 4 1.00 x4
y
## # A tibble: 2 x 2 ## key val_y ## <dbl> <chr> ## 1 1.00 y1 ## 2 2.00 y2
x %>%
inner_join(y, key = c("key", "key"))
## Joining, by = "key"
## # A tibble: 4 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1.00 x1 y1 ## 2 2.00 x2 y2 ## 3 2.00 x3 y2 ## 4 1.00 x4 y1
x %>%
right_join(y, key = c("key", "key"))
## Joining, by = "key"
## # A tibble: 4 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1.00 x1 y1 ## 2 1.00 x4 y1 ## 3 2.00 x2 y2 ## 4 2.00 x3 y2
x %>%
left_join(y, key = c("key", "key"))
## Joining, by = "key"
## # A tibble: 4 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1.00 x1 y1 ## 2 2.00 x2 y2 ## 3 2.00 x3 y2 ## 4 1.00 x4 y1
x %>%
full_join(y, key = c("key", "key"))
## Joining, by = "key"
## # A tibble: 4 x 3 ## key val_x val_y ## <dbl> <chr> <chr> ## 1 1.00 x1 y1 ## 2 2.00 x2 y2 ## 3 2.00 x3 y2 ## 4 1.00 x4 y1
In the babynames package there are actually several tables. Start by renaming births, which has data from the US Census.
births
## # A tibble: 119 x 2 ## year births ## <int> <int> ## 1 1909 2718000 ## 2 1910 2777000 ## 3 1911 2809000 ## 4 1912 2840000 ## 5 1913 2869000 ## 6 1914 2966000 ## 7 1915 2965000 ## 8 1916 2964000 ## 9 1917 2944000 ## 10 1918 2948000 ## # ... with 109 more rows
census_births <- births
The babynames data is from the social security administration. Let's subset (to make things interesting), then get it into the same form as census_births.
ssa_births <- babynames %>% filter(year < 2012) %>% group_by(year) %>% summarize(N = n(), births = sum(n))
inner_join(). What does this table represent? What time span is covered by this table?left_join(). What does this table represent? What time span is covered by this table?right_join(). What does this table represent? What time span is covered by this table?full_join(). What does this table represent? What time span is covered by this table?